Fast Construction of a Word-Number Index for Large Data
نویسندگان
چکیده
The paper presents a work still in progress, but with promising results. We offer a new method of construction of word to number and number to word indices for very large corpus data (tens of billions of tokens), which is up to an order of magnitude faster than the current approach. We use HAT-trie for sorting the data and Daciuk’s algorithm for building a minimal deterministic finite state automaton from sorted data. The latter we reimplemented and our new implementation is roughly three times faster and with smaller memory footprint than the one of Daciuk. This is useful not only for building word↔number indices, but also for many other applications, e.g. building data for morphological analysers.
منابع مشابه
Automatic Construction of Persian ICT WordNet using Princeton WordNet
WordNet is a large lexical database of English language, in which, nouns, verbs, adjectives, and adverbs are grouped into sets of cognitive synonyms (synsets). Each synset expresses a distinct concept. Synsets are interlinked by both semantic and lexical relations. WordNet is essentially used for word sense disambiguation, information retrieval, and text translation. In this paper, we propose s...
متن کاملA Comparative Study of Multipole and Empirical Relations Methods for Effective Index and Dispersion Calculations of Silica-Based Photonic Crystal Fibers
In this paper, we present a solid-core Silica-based photonic crystal fiber (PCF) composed of hexagonal lattice of air-holes and calculate the effective index and chromatic dispersion of PCF for different physical parameters using the empirical relations method (ERM). These results are compared with the data obtained from the conventional multipole method (MPM). Our simulation results reveal tha...
متن کاملThe effect of Yazd-Eghlid railway construction on diversity and richness of shrub and Bush-tree rangelands inYazd province
Railway construction is one of the important activities in the development of any country and in developing countries, the need for roads is one of the main axes of development. Railway construction operations can effect on desert rangelands around railway. This study investigates the effects of Yazd-Eghlid railway construction on vegetation diversity and richness in the rangelands of Kalmand-...
متن کاملThe Effect of Observation Data Sampling Methods on Infiltration Areas by Maximum Entropy Model
Statistical modeling methods are based on multivariate regression methods and require the presence and absence location of data for the construction of the model. In most cases, there is no trustworthy absence data. Therefore, other methods that are based only on the presence of the phenomenon are used. Considering the importance of modeling - saving time and cost and the probable prediction of...
متن کاملA Novel Multicast Tree Construction Algorithm for Multi-Radio Multi-Channel Wireless Mesh Networks
Many appealing multicast services such as on-demand TV, teleconference, online games and etc. can benefit from high available bandwidth in multi-radio multi-channel wireless mesh networks. When multiple simultaneous transmissions use a similar channel to transmit data packets, network performance degrades to a large extant. Designing a good multicast tree to route data packets could enhance the...
متن کامل